Red Wine Exploratory Analysis by Nathan Myers

Introduction

This project uses R and data analysis techniques to explore a dataset regarding the quality of certain redwines. A modest report on the data can be found at Elsevier, in addition to a short description of the individual variables.

The Dataset itself contains several physicochemical measures and attributes of red vairiants of the Portugese wine “Vnho Verde” made and classified by wine experts.

Univariate Plots

Taking a first look at the Data.

## Observations: 1,599
## Variables: 12
## $ fixed.acidity        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, ...
## $ volatile.acidity     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660,...
## $ citric.acid          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06,...
## $ residual.sugar       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2...
## $ chlorides            <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075,...
## $ free.sulfur.dioxide  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15...
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, ...
## $ density              <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0...
## $ pH                   <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30,...
## $ sulphates            <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46,...
## $ alcohol              <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, ...
## $ quality              <ord> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5,...

There are 12 variables and 1599 observations. All variables are numerical except for the quality score which is represented as a ordered factor.

Quality

##   3   4   5   6   7   8 
##  10  53 681 638 199  18

The distribution of quality appears to be normal with many wines at average quality between 4 - 5 and a few wines at either end. Interestingly there are no wines with a rating below 3 or higher than 8.

Fixed Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Most of the wines have a fixed acidity between 7.1 and 9.20 and a median of 7.90 $g/dm^3.The distribution is slightly skewed to the right. Also notable are the several outliers.

Volatile acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

The distribution of volatile acidity is non-symmetric and bimodal with two peaks at 0.4 and 0.6. The median value is 0.52. Most observations within the data set fall in the range 0.39 - 0.64.

Citric acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

## Residual sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

The distribution of residual sugar has a median value of 2.2 \(g/dm^3\). and skewed to the right with a long tail.

Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

The amount of chlorides in the wines has a median value of 0.079 \(g/dm^3\). The distribution appears normal around its main peak but has an unusually long right tail, with small counts of wines with values until 0.611 \(g/dm^3\)

Free sulfur dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

The distribution of free sulfur dioxide has a median value of 14 \(g/dm^3\), it is right skewed and a long tail.

Total sulfur dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

The distribution of total sulfur dioxide is right skewed. It has a median value of 38 \(mg/dm^3\). On the right tail we can see a local maximum near 80. There’s a gap between 165 and 278 with only two wines with a concentration greater than or equal to 278, with significant outliers betweem the 280 - 290 mark.

Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

The density of wines varies , with most of the values between 0.9956 and 0.9967. The distribution is roughly symmetric and has median value of 0.9968 \(g/cm^3\). The density if close to the density of water (1 \(g/cm^3\) at 4 \(^\circ C\)).

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

All wines have low pH due to the fermantation process, making them acidic. This plot is the ambigious so far as the distribution could be either symmetrical or bimodal depending on the interpretation. There seems to be a local maximum at about and again at 3.35. The median value is 3.31, and most ines have a pH between 3.21 and 3.4.

Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The distribution of sulphates is slightly skewed to the right. Some outliers on the right tail have around 2 g/dm^3 of sulphates.The median value of sulphates is 0.62 and most wines have a concentration between 0.55 and 0.73.

Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The alcohol concentration distribution is right skewed. The range of about 7, with a min of 8.4 is exactly what we would expect for wine. The highest peak of the distribution is at 9.5 % alcohol and the median value is 10.20%. The maximum amount of alcohol present in the dataset is 14.90.

Univariate Analysis

The dataset has 12 variables regarding 1599 observations. Each observation corresponds to a specific sample of red wine. 11 variables correspond to the result of a physicochemical test and one variable (quality) corresponds to the result of a sensory panel rating by experts

The main feature of interest is the correlation between various empirical measurements and quality. I am also interested in why many of the distributions have outliers.

There were no unusual distributions, missing values or need to adjust the data. The dataset presented clean and perfect for analysis.

Bivariate Plots Section

Fixed Acidity vs. Quality

## [1] "Median of fixed.acidity by quality:"
## wines$quality: 3
## [1] 7.5
## -------------------------------------------------------- 
## wines$quality: 4
## [1] 7.5
## -------------------------------------------------------- 
## wines$quality: 5
## [1] 7.8
## -------------------------------------------------------- 
## wines$quality: 6
## [1] 7.9
## -------------------------------------------------------- 
## wines$quality: 7
## [1] 8.8
## -------------------------------------------------------- 
## wines$quality: 8
## [1] 8.25

We see a slight upwards trend between quality and fixed acidity. However, the extreme quality classes (3 and 8) have less occurances, which may skew the median. There is also a slight aciddity drop from 7 to 8 in quality class. Additionally, we see a big dispersion of acidity values across each quality scale. This may be a indicator that the quality cannot be predicted based soley on acidity and is the result of a combination of variables.

Volatile Acidity vs. Quality

## [1] "Median of volatile.acidity by quality:"
## wines$quality: 3
## [1] 0.845
## -------------------------------------------------------- 
## wines$quality: 4
## [1] 0.67
## -------------------------------------------------------- 
## wines$quality: 5
## [1] 0.58
## -------------------------------------------------------- 
## wines$quality: 6
## [1] 0.49
## -------------------------------------------------------- 
## wines$quality: 7
## [1] 0.37
## -------------------------------------------------------- 
## wines$quality: 8
## [1] 0.37

Compared to the Acidity plot we can see a more obvious trend. Lower volatile acidity seems to mean higher wine quality.

Citric Acid vs. Quality

## [1] "Median of citric.acid by quality:"
## wines$quality: 3
## [1] 0.035
## -------------------------------------------------------- 
## wines$quality: 4
## [1] 0.09
## -------------------------------------------------------- 
## wines$quality: 5
## [1] 0.23
## -------------------------------------------------------- 
## wines$quality: 6
## [1] 0.26
## -------------------------------------------------------- 
## wines$quality: 7
## [1] 0.4
## -------------------------------------------------------- 
## wines$quality: 8
## [1] 0.42

Higher citric acid content is strongly correlated with a higher quality wine. The citric acid is always in low concentrations and in the univariate plots we saw that the distribution peaked at the zero value.

What proportion of wines has zero citric acid.

## [1] 0.08255159

For each class the proportions are:

## # A tibble: 6 x 2
##   quality n_zero
##   <ord>    <dbl>
## 1 3       0.3   
## 2 4       0.189 
## 3 5       0.0837
## 4 6       0.0846
## 5 7       0.0402
## 6 8       0

We see a decreasing proportion of wines with zero citric acid on the higher quality classes.

So, this reinforces the first impression that the higher citric acid concentration relates to higher quality wines.

Residual Sugar vs. Quality

## [1] "Median of residual.sugar by quality:"
## wines$quality: 3
## [1] 2.1
## -------------------------------------------------------- 
## wines$quality: 4
## [1] 2.1
## -------------------------------------------------------- 
## wines$quality: 5
## [1] 2.2
## -------------------------------------------------------- 
## wines$quality: 6
## [1] 2.2
## -------------------------------------------------------- 
## wines$quality: 7
## [1] 2.3
## -------------------------------------------------------- 
## wines$quality: 8
## [1] 2.1

## [1] "Median of residual.sugar by quality:"
## wines$quality: 3
## [1] 2.1
## -------------------------------------------------------- 
## wines$quality: 4
## [1] 2.1
## -------------------------------------------------------- 
## wines$quality: 5
## [1] 2.2
## -------------------------------------------------------- 
## wines$quality: 6
## [1] 2.2
## -------------------------------------------------------- 
## wines$quality: 7
## [1] 2.3
## -------------------------------------------------------- 
## wines$quality: 8
## [1] 2.1

Residual sugar has a low impact in the quality of the wine.

Chlorides vs. Quality

## [1] "Median of chlorides by quality:"
## wines$quality: 3
## [1] 0.0905
## -------------------------------------------------------- 
## wines$quality: 4
## [1] 0.08
## -------------------------------------------------------- 
## wines$quality: 5
## [1] 0.081
## -------------------------------------------------------- 
## wines$quality: 6
## [1] 0.078
## -------------------------------------------------------- 
## wines$quality: 7
## [1] 0.073
## -------------------------------------------------------- 
## wines$quality: 8
## [1] 0.0705

## [1] "Median of chlorides by quality:"
## wines$quality: 3
## [1] 0.0905
## -------------------------------------------------------- 
## wines$quality: 4
## [1] 0.08
## -------------------------------------------------------- 
## wines$quality: 5
## [1] 0.081
## -------------------------------------------------------- 
## wines$quality: 6
## [1] 0.078
## -------------------------------------------------------- 
## wines$quality: 7
## [1] 0.073
## -------------------------------------------------------- 
## wines$quality: 8
## [1] 0.0705

There is a slight relationship between lower chloride levels and quality.

Free sulfur dioxide vs. Quality

## [1] "Median of free.sulfur.dioxide by quality:"
## wines$quality: 3
## [1] 6
## -------------------------------------------------------- 
## wines$quality: 4
## [1] 11
## -------------------------------------------------------- 
## wines$quality: 5
## [1] 15
## -------------------------------------------------------- 
## wines$quality: 6
## [1] 14
## -------------------------------------------------------- 
## wines$quality: 7
## [1] 11
## -------------------------------------------------------- 
## wines$quality: 8
## [1] 7.5

The classes in the center of the distributions have a higher measurement of free sulfur dioxide than wines of high and low quality.

According to the dataset description, when free SO2 is lower than 50 ppm, roughly, ~ 50 mg/L, it is undetectable. The following plot shows that very few wines are above this threshold, leading us to believe the variations are not related to levels free SO2, but the non balanced distribution of wines across the quality classes.

Total sulfur dioxide vs. Quality

## [1] "Median of total.sulfur.dioxide by quality:"
## wines$quality: 3
## [1] 15
## -------------------------------------------------------- 
## wines$quality: 4
## [1] 26
## -------------------------------------------------------- 
## wines$quality: 5
## [1] 47
## -------------------------------------------------------- 
## wines$quality: 6
## [1] 35
## -------------------------------------------------------- 
## wines$quality: 7
## [1] 27
## -------------------------------------------------------- 
## wines$quality: 8
## [1] 21.5

Almost identical relationship as free sulfur dioxide, concentrated in the center and then distributed low and high.

Density vs. Quality

## [1] "Median of density by quality:"
## wines$quality: 3
## [1] 0.997565
## -------------------------------------------------------- 
## wines$quality: 4
## [1] 0.9965
## -------------------------------------------------------- 
## wines$quality: 5
## [1] 0.997
## -------------------------------------------------------- 
## wines$quality: 6
## [1] 0.99656
## -------------------------------------------------------- 
## wines$quality: 7
## [1] 0.99577
## -------------------------------------------------------- 
## wines$quality: 8
## [1] 0.99494

Low density is significantly correlated with higher quality. The dataset describes the two most important factors of denisty as being alcohol and sugar content.

pH vs. Quality

## [1] "Median of pH by quality:"
## wines$quality: 3
## [1] 3.39
## -------------------------------------------------------- 
## wines$quality: 4
## [1] 3.37
## -------------------------------------------------------- 
## wines$quality: 5
## [1] 3.3
## -------------------------------------------------------- 
## wines$quality: 6
## [1] 3.32
## -------------------------------------------------------- 
## wines$quality: 7
## [1] 3.28
## -------------------------------------------------------- 
## wines$quality: 8
## [1] 3.23

There is a slight relationship between acidity (low ph level) and High quality. Does this mean quality wine should taste acidic? heck correlations between pH and the acidity levels.

Sulphates vs. Quality

## [1] "Median of sulphates by quality:"
## wines$quality: 3
## [1] 0.545
## -------------------------------------------------------- 
## wines$quality: 4
## [1] 0.56
## -------------------------------------------------------- 
## wines$quality: 5
## [1] 0.58
## -------------------------------------------------------- 
## wines$quality: 6
## [1] 0.64
## -------------------------------------------------------- 
## wines$quality: 7
## [1] 0.74
## -------------------------------------------------------- 
## wines$quality: 8
## [1] 0.74

## [1] "Median of sulphates by quality:"
## wines$quality: 3
## [1] 0.545
## -------------------------------------------------------- 
## wines$quality: 4
## [1] 0.56
## -------------------------------------------------------- 
## wines$quality: 5
## [1] 0.58
## -------------------------------------------------------- 
## wines$quality: 6
## [1] 0.64
## -------------------------------------------------------- 
## wines$quality: 7
## [1] 0.74
## -------------------------------------------------------- 
## wines$quality: 8
## [1] 0.74

Strong relationship between high sulphate levels and high quality wine.

Alcohol vs. Quality

## [1] "Median of alcohol by quality:"
## wines$quality: 3
## [1] 9.925
## -------------------------------------------------------- 
## wines$quality: 4
## [1] 10
## -------------------------------------------------------- 
## wines$quality: 5
## [1] 9.7
## -------------------------------------------------------- 
## wines$quality: 6
## [1] 10.5
## -------------------------------------------------------- 
## wines$quality: 7
## [1] 11.5
## -------------------------------------------------------- 
## wines$quality: 8
## [1] 12.15

While average quality wine has slightly less alcohol than normal, above average quality wine has higher alcohol content.

Acidity and pH

This plot was done as a sort of control, as we would expect increases with the lower amount of acids. Fixed acidity accounts for most of the acid present in the wine.

## Warning: Transformation introduced infinite values in continuous y-axis

A similar relation is seen with the citric acid, but the relationsihp is not nearly as strong. This can mostly be attributed to the fact citric acid is only a small percentage of the acidic content.

The volatile acidity seems to have either no relation with the pH or a slight positive correlation.

Correlation coefficient:

## 
##  Pearson's product-moment correlation
## 
## data:  pH and log10(volatile.acidity)
## t = 9.1468, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1760195 0.2691923
## sample estimates:
##       cor 
## 0.2231154

The correlation coefficient shows a weak positive correlation of volatile acidity with the pH. Maybe when the volatile acids are present in higher concentration, the concentration of the remaining acids is lower and that contributes to the increase of pH.

## 
##  Pearson's product-moment correlation
## 
## data:  fixed.acidity and volatile.acidity
## t = -10.589, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3013681 -0.2097433
## sample estimates:
##        cor 
## -0.2561309

Sulphates and sulfur oxide

Sulphate is an additive which can contribute to sulfur dioxide gas levels.

## 
##  Pearson's product-moment correlation
## 
## data:  total.sulfur.dioxide and sulphates
## t = 1.7178, df = 1597, p-value = 0.08602
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.006087119  0.091774762
## sample estimates:
##        cor 
## 0.04294684
## 
##  Pearson's product-moment correlation
## 
## data:  free.sulfur.dioxide and sulphates
## t = 2.0671, df = 1597, p-value = 0.03888
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.002643125 0.100424406
## sample estimates:
##        cor 
## 0.05165757

There is a very weak relationship between sulphat levels and sulphates.

Correlations with quality

##                             [,1]
## fixed.acidity         0.11408367
## volatile.acidity     -0.38064651
## citric.acid           0.21348091
## residual.sugar        0.03204817
## chlorides            -0.18992234
## free.sulfur.dioxide  -0.05690065
## total.sulfur.dioxide -0.19673508
## density              -0.17707407
## pH                   -0.04367193
## sulphates             0.37706020
## alcohol               0.47853169

Bivariate Analysis

There is a strong relationship between wine quality and volatile acidity, citric acid, sulphates and alcohol content. The correlation coefficients demonstrate relationships with the other variables.

Wines with average quality rankings, defined as between 5 and 6, have a higher instance of free and total sulfur dioxide than wines at the extremes.

I also looked at the relationship between ph, acidity levels. The correlation coefficients show that the variable with the strongest relationship with quality is the wine’s alcohol content.

Multivariate Plots Section

Correlation Matrix

Constructing a correlation matrix:

## Warning: Deprecated, use tibble::rownames_to_column() instead.

Alcohol, volatile acidity and quality

Quality strongly correlates with alcohol and volatile acidity.

The plot shows low quality wines have a low alcohol content paired with high volatile acidity. However mid-range wines can be found all over the distribution.

Acidity, pH, quality

There is no pattern or relationship between quality and fixed acidity.

Citric acid, alcohol, quality

There is definitely a relationship between higher wine quality and an increase in citric acid and alcohol. However, there are also wines with average quality ratings with a wide range of citric acid levels and low alcohol content.

Alcohol and Sulphates

Alcohol and sulphates are positively correlated. Additionaly, higher alcohol content combined with higher levels of sulphate results in higher quality wines.

## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and sulphates
## t = 3.7568, df = 1597, p-value = 0.0001783
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.04477906 0.14196454
## sample estimates:
##        cor 
## 0.09359475

Multivariate Analysis

Emphasis was primarily focused on variables with strong relationships.

We have seen how alcohol and volatile acidity relate with quality. Higher alcohol and lower acidity give in general better quality wines.

Higher amounts of citric acid combined with higher alcohol content create the best wines.

Also with sulphates we see a similar trend, better quality wines have higher levels of alcohol and sulphates.


Final Plots and Summary

Acetic Acd vs Quality

`### Plot One

Description

This plot shows us the distribution of volatile acidity across different quality rankings. The scatterplot shows us the distribution of wines while the box plot displays the quantile boundaries and median values. We can also observe most wines have a rating between 5-6, and there are more wines ranked at the higher end of the scale, 7-8, than at the lower end. The red line connects the median values and helps us better visualize the inverse relationship between volatile acidity and wine quality.

Alcohol VS Quality

Description

We can see a big impact of alcohol level on the quality of wines. For the quality classes 3 to 5, the effect is limited. The quality is probably being steered by another variable, but from the quality rating 5 to 8, we see a sharp increase in the alcohol content. The general trend is that Wines with higher alcohol content are rated higher in quality.

There is a large impact on the level of alcohol and the quality of wines one the higher end. Interestingly, there is a less dramatic effect for average and lower quality wines. Howeverm the general trend is wines with a higher alcohol content rate higher in quality. ### Plot Three

Description Three

This plot shows the combined effect of volatile acidity and alcohol on wine quality. Wines with both high volatile acidity and low alcohol content have a lower quality rating and wines with low volatile acididty and low alcohol content have average ratings. Finally, wines with low volatile acididty and high alcohol content have the highest ratings.

Conclusion & Reflection

The Project was a compelling opportunity to tie together all of my knowledge of R and learn the nuances of exploring a dataset. The dataset itself was assembled specifically for data analysis and machine learning techniques, so it was already very well organized and not missing and data points.

One of the most difficult challenges with working this kind of dataset was choosing the direction to steer the analysis in. Fortunately, the dataset description points to some variables of interest already, such as citric acid adding ‘freshness to wine’ or acetic acid adding a vinegary taste. I think this is an example of the significance of domain specific knowledge and how it and can help guide our analysis.

The second challenge i faced was explicating the meaning of a few of the multivariate plots. Adding a third dimension made it more difficult to view and interpret trends. Instead of relying on a line, you have to look closely at the changing colors. The correlation matrix was also helpful in finding variables with the closest relationships and to explore patterns.

As a next step I think it would be interesting to add an analysis of white wines and make comparisons. We could also, build a model to predict wine quality.